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UNIQUE IDENTIFIER FOR BIOLOGICAL SAMPLES 



RELATED APPLICATION^) 

This application claims priority to application 60/076,081, filed February 26, 
1998, the entire teachings of which are incorporated herein by reference. 

5 BACKGROUND OF THE INVENTION 

With the advent of the Human Genome Project and the advances in 
technology mat have resulted, biological and genetic testing have become 
increasingly more common. Hospitals and other health care entities are using new 
tests for diseases, and are processing more samples for testing than ever before. The 
JO ease and speed of many biological tests has also increased enormously, so that these 
tests are now being widely used outside of the health care industry. Veterinarians, 
of course, have always closely followed advances in human health care. But law 
enforcement agencies now routinely employ DNA-based methods in forcnsics, and 
even population geneticists, ecologists, and evolutionary biologists use these 
15 methods to track the evolution and variability within and between populations of 
organisms. 

When Via ^lmg large numbers of samples, accurate and reliable tracking of 
samples and quality control of associated information is vital. In hospital settings, 
aberrant test results are always a cause for concern because doubts are then cast on 
20 the state of-the patient's health. In a tissue repository, it must be possible for a 
sample (or portions of a sample) to be reliably and repeatedly retrieved with no 
doubts as to the sample's identity. Mislabeling or loss of labeling of a sample may 
mean that the sample is rendered useless if it cannot be accurately connected back to 
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the sample's history and/or source. Most samples and their sources are given a 
common alphanumeric designation, and this designation is also linked to 
information about the source and the sample {e.g.> patient name, sample type, 
disease condition, etc.). A loss of this designation from its association with either 
5 the source, or the sample* or the information will often result in a complete loss of 
utility of all three. 

This potential loss of association between the designation and the sample is 
especially likely in settings where very large numbers of samples are being 
processed. Machine errors, while problematic, generally result in the destruction of 

1 0 large numbers of samples, and so are noticed easily. Human error, however, has the 
potential to cause serious errors that go unnoticed for a period of time. These 
include transcription errors, misplacing or swapping of samples, destruction of 
labels, off-by-one errors (resulting in a series of samples where the designation or 
information from each sample is misassociated with the next sample). In addition, 

1 5 pages from lab notebooks can be obliterated or lost, and magnetic media corrupted 
Databases containing all of this information can be backed up, but intervening data 
added to the database since the last backup is usually lost. If an error is introduced 
and not discovered until after a backup is made, then this error effectively replaces 
the **true" data. In addition, many facilities save only the most recent backup, or 

20 store backups at the same site as the current data, resulting in loss of all information 
in the event of a physical disaster (e.g. a fire); 

SUMMARY OF THE INVENTION 

The present invention relates to a method of creating a unique identifier for 
reliably identifying samples, their sources, and associated information. The use of 
25 the identification system described herein substantially decreases potential mixups 
and misidentification of samples, their sources, and associated information. 

Specifically, the present invention provides a method for creating a unique 
identifier which is used to label the sample, its source, or the associated information, 
based on the polymorphisms inherent in the sample and its source. One or more 
30 polymorphisms in the sample is detected, and the resulting polymorphism data is 
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used to produce a unique identifier, which is then used to identify the sample. This 
unique identifier cairalso be liiikediviththe source; and/or any information that may 
be associated with either the sample or the source (i.e. the unique identifier can be 
used as a common designation for the sample, its source and/or other relevant 

5 information). If this unique identifier is separated from the sample, then the 
polymorphisms within the sample simply need to be re-detected to reproduce the 
polymorphism data which is then used to produce the unique identifier, thereby 
recreating the proper unique identifier, and, ultimately, its link to its source. 

In general, the invention features a method for producing a unique identifier 

10 for a biological sample, comprising detecting one or more polymorphisms within the 
biological sample, and selecting one or more polymorphisms sufficient to form a 
unique identifier. The biological sample can be from a vertebrate, an invertebrate, a 
plant, or consist of microorganisms. The biological sample can also be from a 
mammal, particularly a human. The sample can be blood, saliva, hair, body fluid, 

15 tissues, organs, one or more cells, or a whole organism. The polymorphisms can be 
nucleic acid polymorphisms, protein polymorphisms, enzyme polymorphisms, 
chemical polymorphisms, biochemical polymorphisms, phenotypic polymorphisms, 
and quantitative polymorphisms, particularly a nucleic acid sequence polymorphism, 
a nucleic acid length polymorphism, or a short tandem repeat (STR). The unique 

20 identifier can also be linked to the source of the biological sample, or relevant 

information about the biological sample or the source of the biological sample. . The 
unique identifier can be in the form of an alphanumeric string, or a bar code. 

The invention also features a method for establishing a repository containing 
a collection of biological samples, comprising obtaining a biological sample from a 

25 source, detecting one or more polymorphisms in the sample, selecting one or more 
polymorphisms sufficient to form a unique identifier, using the unique identifier to 
identify the sample, storing the sample with the unique identifier, and repeating 
these steps for biological samples from other sources. The samples, in general, are 
DNA-containing samples, particularly from humans, and the polymorphisms are 

30 nucleic acid polymorphisms, protein polymorphisms, enzyme polymorphisms, 

chemical polymorphisms, biochemical polymorphisms, phenotypic polymorphisms, 
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and quantitative polymorphisms, or short tandem repeat (STR). The unique 
identifier can be in the form of an alphanumeric string, or a bar code, and can also be 
linked to the source of the biological sample, or relevant information about the 
biological sample or the source of the biological sample. 

5 In addition, the invention features a method of determining, by means of a 

unique identifier, if a source is represented by a sample within the repository, 
comprising obtaining a sample from the source, detecting one or more 
polymorphisms in the sample selecting one or more polymorphisms sufficient to 
form a unique identifier, and comparing the unique identifier so produced to the 

10 unique identifier of each sample in the repository, where shared identity between the 
two unique identifiers indicates that the source is already represented in the 
repository. In general, the samples are DNA-containing samples, preferably from 
humans. The polymorphisms are nucleic acid polymorphisms, particularly short 
tandem repeats (STR), protein polymorphisms, enzyme polymorphisms, chemical 

1 5 polymorphisms, biochemical polymorphisms, phenotypic polymorphisms, and 
quantitative polymorphisms. The unique identifier can also be linked to the source 
of the biological sample, or relevant information about the biological sample or the 
source of the biological sample. The unique identifier can be in the form of an 
alphanumeric string, or a bar code. 

20 The invention also features a method for linking, by means of a unique 

identifier, a first biological object lacking a unique identifier with a second object 
having a unique identifier, comprising detecting one or more polymorphisms in the 
first biological object, selecting one or more polymorphisms sufficient to form a 
unique identifier, and comparing the unique identifier so made to the unique 

25 identifier of the second object, where shared identity between the two unique 
identifiers links the first biological object with the second object The biological 
sample can be from a vertebrate, an invertebrate, a plant, or consist of 
microorganisms. The biological sample can also be from a mammal, particularly a 
human. The sample can be blood, saliva, hair, body fluid, tissues, organs, one or 

30 more cells, or a whole organism. The polymorphisms can be nucleic acid 

polymorphisms, particularly short tandem repeats (STR), protein polymorphisms, 
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enzyme polymorphisms, chemical polymorphisms, biochemical polymorphisms, 
phenotypic polymorphisms, and quantitative polymorphisms, particularly a nucleic 
acid sequence polymorphism, or a nucleic acid length polymorphism. The unique 
identifier can also be linked to the source of the biological sample, or relevant 
5 information about the biological sample or the source of the biological sample. The 
unique identifier can be in the form of an alphanumeric string, or a bar code. 

DETAILED DESCRIPTION OF THE INVENTION 

The present invention provides a method for creating a unique identifier for 
identifying biological samples, their sources, or associated information based on the 
10 polymorphisms inherent in the sample and source themselves. In this method, the 
nucleic acid contained within the sample itself is used to produce the unique 
identifier for identifying and linking the sample, its source, and associated 
information. No external material needs to be added to the sample which could 
dilute or alter the accuracy of other test results. An advantage of the invention, 
15 therefore, is that it is unnecessary to add "identifying sequences" to the samples, and 
that without such additions, one may conduct studies of genetics, disease 
associations, evolutionary relationships, etc., without the results being tainted by the 
added identifying sequences. 

A "source" or "the source from which the sample is derived" refers to the 
20 originating material for a sample. A source of a biological sample, for example, can 
be a human, any animal, plant, insect, or a population or strain of microorganisms. 
A source of a biological sample does not have to be living, and can be a deposit in a 
tissue repository, herbaria or museum specimens, forensic specimens, or fossils. A 
"potential source" as used herein, means a source from which the sample may 
25 possibly have been taken in the past. 

By "sample" is meant a portion of source biological material that originated 
elsewhere, i.e., the sample was removed from its source. A sample can be any 
biological sample, (e.g., blood, saliva, hair, organs, biopsies, bodily fluids, one or 
more cells), and can be taken from any vertebrate, including mammals such as 
30 humans, or plant, insect, reptile. The sample can also be a strain or mixed 
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population of microbes. Samples can also be biological materials taken from 
defunct or extinct organisms, e:g:; samples can be taken from pressed plants in 
herbarium collections, or from pelts, taxidermy displays, fossils, or other materials 
in museum collections. 

5 "Information associated with" the sample or source from which the sample is 

derived is meant to include, without limitation, any information that might be 
necessary or advantageous to be linked to the sample or the sample source, e.g. 
name, address, sex, medical history (in the case of human samples), species, 
collection data, provenance (in case of non-human samples), etc. 

10 Once a biological sample is taken from its source, the sample is tested across 

. one or more polymorphic loci, and the polymorphic data produced are used to create 
a unique identifier, which is identifiably linked to the sample, and serves as its 
unique designation. This unique identifier can also be identifiably linked to the 
sample's source, and/or any information that may exist concerning the source and/or 

15 the sample. By saying that the unique identifier is "identifiably linked" to the 
sample, sample source, or related information means that it is connected in some 
way with any or all of these three things, e.g. the unique identifier may be on a label 
attached to a container holding the sample, the unique identifier may exist as a field 
in a database record containing medical data regarding the source, etc. In essence, 

20 the genetic code of the sample itself which is unique and forms the basis of the 

polymorphisms tested, serves as the unique identifier. Because the unique identifier 
is based on the genetic code, which is unique between individuals, the unique 
identifier will also be unique between samples from different source individuals. 
A ''polymorphism" is an allelic variation between two samples. As used 

25 herein, the term includes differences between proteins (e.g., enzymes, blood groups, 
blood proteins), differences in the chemicals and biochemicals (e.g., secondary 
metabolites) produced by the source organism(s), differences between nucleic acids 
involving differences in the nucleotide sequence (e.g., restriction site maps), or 
differences in length of a stretch of nucleic acid (e.g 9 RFLPs (restriction fragment 

30 length polymorphisms), microsatellites, STRs (short tandem repeats), SSRs (simple 
sequence repeats), SSLPs (simple sequence length polymorphisms), VNTRs 
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(variable number tandem repeats)). Allelic variation can also result in phenotypic 
(i.e., visually- apparent) polymorphisms, or variations in quantitative characters (e.g. 
variation in height, length, yield of fruit, etc.) between the organisms that serve as 
the source of the samples. With some types of biological material, phenotypic 
5 differences may be visible in the samples themselves, e.g. , kernels of different types 
of "Indian" corn often appear very different from each other, with red, yellow, white, 
blue, streaked kernels, etc. With such samples, phenotypic polymorphisms could 
also be used to produce the unique identifier. 

A polymorphism is not limited by the function or effect it may have on the 
1 0 organism as a whole, and can therefore include allelic differences which may also be 
a mutation, insertion, deletion, point mutation, or structural difference, as well as a 
strand break or chemical modification that results in an allelic variant. A 
polymorphism between two nucleic acids can occur naturally, or be caused 
intentionally by treatment (e.g., with chemicals or enzymes), or can be caused by 
15 circumstances normally associated with damage to nucleic acids (e.g. 9 exposure to 
ultraviolet radiation, mutagens or carcinogens). 

As used herein, a "sequence polymorphism" is a difference in the sequence 
of two nucleic acids or two amino acids. Two amino acid sequences can differ by 
having different residues at a particular position (i.e., and amino acid substitution), 
20 or some residues may be deleted, or new residues inserted or added to one or more 
ends. Two nucleic acids differing in sequence may have the same number of base 
pairs (e.g., "AT£rC" vs. "ATIC"), but may also include some differences in overall 
sequence length as well (e.g„ "ATCAQATG" vs. "ATCACACATG'*). Types of 
commonly-studied polymorphisms Caused by sequence differences include 
25 restriction site polymorphisms, isozymes, differences in protein conformation, and 
length polymorphisms. If the nucleic acid is sequenced, then a sequence difference 
itself (as represented by the string of letters) serves as the polymorphism. 

As used herein, a "length polymorphism" is a difference in the length of two 
nucleic acids. Two different nucleic acids with a length polymorphism between 
30 them also have a sequence polymorphism, but many methods used to detect a length 
polymorphism do not reveal the exact sequence polymorphism. Commonly-used 
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types of length polymorphisms include RFLPs (restriction fragment length 
polymorphisms), microsatellites, STRs (short tandem repeats), SSRs (simple 
sequence repeats), SSLPs (simple sequence length polymorphisms), and VNTRs 
(variable number tandem repeats). 

5 In general, the difference between "length polymorphisms*' and "sequence 

polymorphisms" is generally in the methods used to detect them. With RFLPs, for 
example, restriction endonucleases are used to cut a nucleic acid molecule into 
fragments, which are then separated on an agarose gel. The differences between two 
individuals are measured by the changes in size of the resultant nucleic acid 

10 fragments, and so are referred to as length polymorphisms, yet those differences are 
caused by differences in the underlying sequence, which is the basis for the change 
in restriction sites, and therefore the changes in the sizes of the nucleic acid 
fragments. Because the method of detection/visualization can only differentiate on 
the basis of fragment length, the RFLPs are generally classed as length 

15 polymorphisms. . 

As used herein, a "polymorphic locus" is a segment of nucleic acid which 
may contain a polymorphism as described above. It is not required that the precise 
sequence of the nucleic acid be known. A polymorphic locus is not limited to those 
loci which are polymorphic in all situations, a polymorphic locus which 

20 displays an allelic variation between individuals A and B, but not between 
individuals A and C, remains a polymorphic locus for purposes of comparing 
individuals A and B, as well as individuals B and C. 

<e Nucleic acid" means deoxyribonucleic acid (DNA), ribonucleic acid 
(RNA), nucleic acids from mammals or other animals, plants, insects, bacteria, 

25 viruses, or other organisms. 

By "unique identifier" is meant an identification tag, designation, or code to 
be linked to a sample, its source, or other information, such as patient case history, 
disease testing results, genetic testing results, geographic or temporal collection data, 
or any other information which may be useful when linked with the sample or 

30 source. The unique identifier can exist in the form of an alphanumeric string, a bar 
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code, an entry in a database, or any other useful human-readable or machine- 
readable form. 

The extent to which such an identifier will be unique depends on the loci 
chosen for polymorphism testing. Allelic polymorphism has been studied for 
5 decades, and there are many genetic systems which have been commonly used in 
assessing polymorphism between populations or individuals. These include 
classical blood groups, blood proteins, isozymes, distribution of restriction 
endonuclease sites, restriction fragment length polymorphisms (RFLPs), and others. 
The most successful to date, however, have been microsatellites, also known as short 
10 tandem repeats (STRs), or simple sequence length polymorphisms (SSLPs), or 
variable number tandem repeats (VNTRs). 

STRs are stretches of DNA that consist of repeated sequences repeats. The 
base sequence is usually just a few base pairs long, typically two to twelve base 
pairs, but longer base repeats have been seen. This base sequence is then tandernly 
1 5 repeated, and the number of times it is repeated can vary greatly, depending on the 
STR locus being studied. An STR can therefore be expressed as (X) n , where X is the 
repeated sequence, (e.g. M CA") and n is the number of times that X is repeated. 

Most individuals in a population will have the same STR at the same 
location in the genome, that is different individuals will have the same base repeat 
20 at the same location, but the precise number of repeats often varies from individual 
to individual. For example, for a given STR mapped to a particular location on the 
genome, the base sequence may be repeated 5 times in individual A, but may be 
repeated 8 times in individual B and 20 times in individual C. 

These tandem repeats are believed to be caused by "slippage" of the DNA 
25 polymerase enzyme as the DNA is replicated. In general, n increases over 

generations, and the amount of slippage varies over time and in different lines. The 
variability of these repeated sequences is generally correlated to the length of the 
base repeat, with STRs composed of longer base repeats exhibiting less variability 
between individuals than shorter base repeats. For example, a two base pair repeat 
30 may consist of a two base pair unit being repeated hundreds of time in an individual, 
while a 12-base pair unit may only be repeated a few times. In general, the amount 
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of slippage that occurs during replication, and therefore the amount of variability in 
the number of repeats that results from that slippage, is also correlated to the length 
of the base repeat. Short repeats tend to exhibit higher rates of polymorphism 
between individuals, while 10- or 12-base pair repeats may show little or no 
5 variability. 

STRs can be amplified and detected by known procedures. For example, 
they can be detected by eiectrophoretic separation followed by radionuclide or 
fluorescent labeling, or silver staining. They have many advantages over other 
methods of detecting polymorphisms (e.g., RFLPs) because of their small size, the 
10 ease and speed with which they can be detected and analyzed, and the fact that the 
process is amenable to automation. The more recent generations of large-scale 
genetic maps have been made using STRs (Hudson, T.J. et aL, Science 270:1945- 
1954 (1995); Dietrich, W.F., et al, Nature Genetics 7:220-245 (1994); Yerle, M., et 
aL, Mamm. Genome 6:176-186 (1995); Jacob HJ., et ai, Nature Genetics 9:63-69 
15 (1995)). Because of the extremely high rate of polymorphism of some of the STR 
loci, they are also used in forensic tests by law enforcement agencies. 

A number of kits for amplifying STR loci are commercially available, and 
the rates of polymorphism of these loci in different ethnic backgrounds are known. 
These include AmpFlSTR™ Profiler, AmpF/STR™ Profiler Plus, AmpF/STR™ 
20 Green I (PE Applied Biosystems, Foster City, California, USA), the Geneprint™ 
STR Systems (Promega Corp., Madison, WI), including Geneprint™ PowerPlex™ 
LI, Geneprint™ PowerPlex™ 1.2, Geneprint™ PowerPlex™ 2, and Geneprint™ 
PowerPlex™ 16, Sex Determination Systems, and others. These STR systems were 
developed for use in humans, but microsatellite markers have been developed in 
25 other organisms, including horse, cattle, sheep, goat, dog, pig, mouse, rat, barley, 
corn, soybean, and others. 

These loci can be used singly, or can be combined, depending on the power 
of discrimination required As the number of organisms being studied and the 
number of individuals from which samples are removed and archived increases, the 
30 degree of polymorphism required to uniquely identify each sample also increases, 



WO 99/43855 



-11- 



PCT/US99/04094 



and the number of polymorphic loci that need to be tested to have a sufficient 
number to create the unique identifier also increases. 



For example, if three individuals possessed the following alleles at three 
different loci: 





Locus 1 


Locus 2 


Locus 3 


Individual A 


1,3 


1,2 


3,5 


Individual B 


2,3 


2,2 


1,4 


Individual C 


2,3 


1,5 


3,5 



then detection of the alleles at Locus 1 would allow a sample from Individual A to 
10 be distinguished from a sample from B or C, but samples from B and C could not be 

uniquely distinguished from each other, and a second locus would need to be tested. 

On the other hand, Locus 2 alone could serve as the unique identifier, because by 

itself, it can serve to distinguish between samples from all three individuals. 

The Power of Discrimination (P 0 ) of a given system of loci is defined as the 
1 5 probability that two individuals selected at random will differ with respect to that 

given system of loci. The P D is related to the Probability of Identity (Pj) by the 

equation 

p D =i-Pi, 

Where Pj is determined by solving the equation 

20 Pi = ZXi 2 , 

where X { is the frequency in the population of the /th allele. The allelic frequencies 
within different ethnic populations are known for many of the polymorphisms of the 
STR loci in the commercially-available kits, so a set of STRs which will provide a 
unique identifier for every sample can be chosen, even if the final number of 

25 samples is not known. Combinations of loci can be chosen that have matching 
probabilities of less than 1 in several million or more (See, for example Table 1). 
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Table 1. Matching probabilities of various populations in the Geneprint™ STR 
system, using fluorescent detection (Promega Corp., Madison, Wisconsin, USA). 





Caucasian-American 


African-American 


Hispanic-American 


CTTv Quadriplex 


1/6623 


1/25575 


1/7194 


FFFL Quadriplex 


1/2632 


1/16807 


1/3279 


Both combined 


1/17,400,000 


1/430,000,000 


1/23,600,000 



Once the polymorphism rates are known for a series of loci, one can choose 
which loci and how many will go into making up the unique identifier. If it is 
anticipated that the final number of samples will be relatively small, than only a few 
loci are sufficient to form the unique identifier, and one or more of the loci may not 
10 need to have a high P D . On the other hand, if one intends to store a very large 

number of samples, then it would be prudent to use more loci, each with a high P D . 
The loci seleected will be based on considerations such as P D , anticipated size of the 
repository, ease of use, applicability to the organisms being sampled, cost, and 
availability. 

IS EQlymwphisifls u?gd 

The polymorphisms that can be used in the invention will vary depending on 
the types of samples being stored. STRs are well-studied in humans, and kits are 
commercially available for amplifying a number of STR loci. Genetic maps based 
on STRs have been built for other organisms (e.g., mouse, rat, pig). STRs appear to 

20 exist in most higher organisms, and are easy to isolate and characterize. Because the 
methods used to identify and assess STRs are virtually identical for different 
organisms, one skilled in the art can isolate STRs in an organism of choice, assess 
the polymorphism rates, and choose those most useful in the present invention. For 
many organisms, STRs and their primer sequences have been published in the 

25 scientific literature. One wishing to use previously published STRs need only order 
those primers (e.g. 9 custom primers can be ordered and received in 48 hours from 
Research Genetics, Huntsville, Alabama, USA), and then use them to amplify the 
STRs in the DNA of the collected samples. 
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It is not necessary to use commercially-available primers to practice the 
present invention, nor is it necessary to use microsatellite markers developed by 
others. The present invention allows one to use any polymorphic marker that is 
convenient, so long as it provides a power of discrimination between individuals. 
5 There are many species worthy of study for which no genetic map exists. One of the 
reasons that microsatellite markers have become so successful is that they are easy 
to develop for previously unstudied organisms. One already familiar with an 
organism for which there are no microsatellite markers can develop them with 
relative ease using methods well-known in the art. 

10 Uses oftfafaventjiQn 

The method described herein will be of particular use in a pathology 
laboratory or testing facility, or a large-scale cryogenic repository. Maintaining the 
integrity of the sample labels is of paramount importance in these situations, as 
quality control problems often result from failure of the record-keeping system. 

15 Naturally, such a method will also be of use to blood banks, tissue banks, and 
veterinary hospitals and testing facilities. 

The method can also be used by large repositories to identify misplaced or 
misidentified samples. For example, a tissue bank may take a small piece (e.g., a 
sample) of a stored tissue (e.g., a source) for testing (e.g., tissue typing for a 

20 potential recipient of the tissue). If the identification were disassociated from the 
sample (e.g., the label fell off the test tube), those test results would normally be 
lost. Using the unique identifier of the present invention, however, one would 
simply test the sample for the polymorphic loci, and recreate the unique identifier. 
The sample (and the tissue typing test results) could then be reassociated with the 

25 source in storage. 

The method is especially useful to maintain the long-term integrity of 
samples and associated information, especially in tissue repositories. Many 
biomedical studies require analysis of tissue samples from large populations of 
individuals with known medical, dietary, genetic, social, and cultural backgrounds. 

30 During the course of a study requiring several years to complete, it may be necessary 
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to test a particular sample several times. It is therefore vital to the accuracy of the 
study to confidently re-retrieve that sample. 

Many biomedical studies involve analysis of a group of individuals with a set 
of characteristics in common (e.g., cigarette smoking, ethnic background, incidence 

5 of particular cancers). At present, the amount of time and effort involved in 

assembling a set of individuals appropriate for a study may be greater than the effort 
of conducting the study itself. If samples from a large number of individuals could 
be collected in a repository along with associated data on the individuals, then it 
should be possible to "assemble" a set of individuals for a given study by selecting 

1 0 samples from these individuals chosen on the basis of a defined set of 

characteristics. For example, if a blood sample repository contained samples from 
100,000 individuals, and associated medical data on those same individuals in 
computerized form, then a medical study could be conducted by selecting 
individuals "with desired characteristics (as listed in the computerized medical data), 

15 and then retrieving samples (or more likely, sub-samples) from those individuals, 
which are held in the repository. The method of the present invention is useful in 
establishing such a repository, because the method greatly reduces the likelihood of 
samples being misidentified and allows confident re-retrieval of samples. 

Another advantage of the invention is that if the unique identifier is de- 

20 associated from the sample within the repository (e.g., the label falls off the tube) 
analysis of the polymorphisms in the sample allows re-creation of the unique 
identifier. 

Use of the method of the invention also provides a method for preventing 
repository deposit of samples from duplicate sources, because when the unique 
25 identifier is created for a sample from a new source, one need only search the 
repository records for that same unique identifier to see if the source is already 
represented by a sample in the repository. 

The invention also has uses outside of the medical field. Because of the 
increasing ease with which samples from various sources {e.g., plant, animal, 
30 microbial, fungal, viral) can be tested for polymorphisms, the invention is applicable 
in any situation where a large number of biological samples may be stored. An 
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example of such a situation would be a field study of biodiversity of a wild 
population. New tests for assessing diversity (i.e., assessing polymorphism) are 
continually being created, and samples from previous collecting expeditions 
represent a "snapshot" of the biodiversity that existed in the past. Such previously- 

5 collected samples can be re-tested using current techniques, but the results are only 
useful if the integrity of the sample designations is still sound, and the samples can 
be linked to their original collection data. Maintenance of quality of the record 
keeping is especially important if the field samples are from species which are 
endangered or extinct. The method of the invention provided here has potential uses 

10 in studies of population genetics, evolutionary genetics, and ecology. In studies of 
flora and fauna from locales that are either increasing or decreasing in pollution, for 
example, it is necessary to both store the samples for a period of time and also 
maintain their identification. Such sampling at periodic intervals is also a 
requirement of an effective bioremediation plan. 

1 5 There are many situations where one would want to keep a biological sample 

for a period of time against the possibility of testing it again later. For example, 
even if one has conducted a population genetics study on a series of samples 
(collection of organisms), a new test developed at a future time may allow the 
testing of different hypotheses, and provide the answer to new questions, without 

20 necessitating collection of new samples in the field. Therefore, the method of the 
invention described herein would be especially useful in maintaining collections of 
samples from endangered species. The unique identifier and identity of each sample 
can be re-verified from the sample itself. 

This sample identification method can be used to keep track of samples in 

25 any study or collection where there are a large number of biological samples being 
stored for a period of time, and where there is a chance that samples may become 
misplaced or mislabeled. 



EX A M PL ES 

Example I ; Use of STRs to Create a Unique Identifier 
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A biological sample is obtained from a human, and an aliquot is taken for 
polymorphism testing. DNA is isolated by methods well known in the art (Maniatis 
et al, Molecular Cloning; A Laboratory Manual, Cold Spring Harbor Laboratory 
Press, New York; Ausubel, F.M. et al 9 eds., Current Protocols in Molecular 

5 Biology). An amount of this isolated DNA is removed, GenePrint™ primers 
(Promega Corp., Madison, WI) for the CSF1PO locus are added to it, and 
amplification is carried out, all according to the manufacturer's instructions 
(supplemental information on thermocycling are well known in the art, see e.g., 
Innis, MS., et al (1990) PCR Protocols: A Guide to Methods and Applications, 

10 Academic Press, Inc. San Diego, CA). After amplification, fluorescence detection is 
carried out, also according to the manufacturer's recommendations. The process is 
repeated for the other loci in the CTTv Quadriplex (TPOX, TH01, vWA), and also 
the four loci in the FFFL Quadriplex (F13A01, FESFPS, F13B, and LPL). 

Once detection of the polymorphism(s) is complete, the unique identifier can 

15 be created for the sample. For a system such as these eight loci, where the alleles 
are 3 to 7 repeats in length, a convenient conversion method is to simply list each 
locus by letter, followed by the two alleles for that locus. For a sample with alleles 
of 3 and 5 tandem repeats at the first locus, alleles of 2 and 7 repeats at the second, 
etc., the unique identifier would be "A35B27...HXY". The precise conversion 

20 method could be varied, depending on the number of repeats in the loci, e.g., a locus 
with 3-12 repeats would require 4 digits after the locus letter. 

Example 2: Preparation of DNA from Samples of Whole Blood 

Red blood cells lack DNA because they are enucleated, and must therefore 
be lysed to facilitate their separation from white blood cells, which contain genomic 

25 DNA. After the red blood cells are lysed and removed, the white blood cells are 
then lysed with an anionic detergent in the presence of a DNA stabilizer, which 
limits the activity of DNase. Contaminating RNA is then degraded with RNase, and 
the RNA, proteins, and other contaminants are then removed by salt precipitation. 
The genomic DNA is recovered by alcohol preciptation, dissolved in TE buffer, and 

30 stored. Because the genomic DNA will be used in a nucleic acid amplification 
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method, it is advisable to also have a "blank" control tube (containing reagents but 
no blood) accompany the blood sample tube through the extraction process. After 
extraction, the "DNA" from the Ir blank" control tube would be amplified to ensure 
that no extraneous DNA has contaminated the extraction process 

5 Isolation of genomic DNA from whole blood can be accomplished by 

following any of a variety of protocols, including using the PUREGENE® kit 
(Gentra Systems, Minneapolis, Minnesota, USA), and following the manufacturer's 
instructions. Place 30 ml of RBC Lysis Solution into a 50 ml tube, add 7 ml to 10 
ml of whole blood, mix by inverting several times, and incubate for 1 0 minutes at 

1 0 room temperature. Invert the tube again at least once during the incubation. 

Centrifuge the tube for 10 minutes at 2,000 x g, pour off the supernatant, leaving 
behind the visible white cell pellet and about 200 jil of residual liquid. Vortex the 
tube vigorously for 20 seconds to resuspend the cells in the residual liquid. Add 10 
ml of Cell Lysis + RNase A (made fresh that day), and vortex on high speed for 10 

IS seconds. Incubate the tube at 37°C for 15 to 30 minutes to allow digestion of the 
RNA. 

Cool the sample to room temperature by placing in an ice bath for 10 
minutes. Add 3.33 ml of the Protein Precipitation Solution (Gentra Systems, 
Minneapolis, Minnesota, USA) into the tube. Vortex at high speed for 20 seconds to 

20 mix uniformly. Centrifuge at 2,000 x g for 10 minutes. If a tight, dark brown pellet 
is not formed, repeat the 20-second vortex, followed by a 5-minute incubation on 
ice, and repeat the 10-minute centrifugation at 2,000 x g. 

Pour off the supernatant into a clean 50 ml tube containing 10 ml of 100% 
isopropanol. Mix by inverting gently 50 times (do not vortex, or the DNA will be 

25 sheared). The DNA is stable at this point, and can be stored indefinitely in the 
isopropanol 

Centrifuge at 2,000 x g for 3 minutes. Carefully pour off the supernatant, 
leaving behind the white pellet, and drain the tube upside down on clean absorbent 
paper. Add 10 ml of 70% ethanol, and wash the pellet by inverting gently, avoiding 
30 dislodging the pellet. Centrifuge at 2,000 x g for 1 minute. Carefully pour off the 
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ethanol, leaving the pellet behind. Invert carefully so as to not dislodge the pellet, 
and drain the tube on clean absorbent paper for 10 minutes. 

Add 1 ml of DNA Hydration Solution (Gentra Systems, Minneapolis, 
Minnesota, USA), and rehydrate the DNA by incubating at 65 °C for 1 hour, 
5 overnight at room temperature, and at 65 °C 1 hour the next day. Tap the tube 
periodically to help disperse the DNA. The DNA in solution can be stored 
indefinitely at 4°C. 

Example 3; Preparation of BlQQd Sffllpk With FTA™ Papsr 

A blood sample is drawn from a human. Two \xl of blood are placed on a 

1 0 piece of FTA™paper (FITZCO, Inc., Maple Plain, Minnesota, USA), dried, and 
stored until ready to be processed. 

To analyze the polymorphisms in the sample, a 1 mm disc is punched 
directly into a 2 ml microcentrifuge tube, and 200 nl ofFTA™ purification reagent 
is placed on the disc. The tube is capped, vortexed for 3-5 seconds, then centrifiiged 

15 in a microcentrifuge at 12,000 x g for 30 seconds. The wash solution is then 
aspirated and discarded. The wash is then repeated with another 200 \x\ of 
purification reagent. After the second wash solution has been aspirated and 
discarded, the disc is washed twice with TE as follows: 200 p.1 of TE buffer is 
added, and the disc vortexed for 3-5 seconds, the tube and disc are then centrifiiged 

20 at 12,000 x g for 30 seconds and the filtrate removed and discarded. After the disc 
has been washed twice with TE, the disc is subjected to polymorphism analysis. 

Example 4: Analysis of Polymorphisms in a Blood Sample 

A 1 mm punch of FTA™ paper containing a blood sample, processed as 
described in Example 2, supra, is placed in a 0.5 ml tube, and tested with the 
25 AmpF/STR Profiler Plus™ system (Perkin Elmer Applied Biosystems, Foster City, 
California, USA), according to the manufacturer's instructions. In general, to the 
tube is added 10.5 |il of Profiler Plus Reaction Mixture, 0.5 jil of Taq Gold, and 5.5 
jil of Primer Mixture. The tube is sealed, and placed in a thermocycler under the 
following conditions: 95 °C for 1 1 minutes, followed by 24 cycles of: 94 °C for 1 



WO 99/43855 



-19- 



PCT/US99/04094 



minute, 59°C for 1 minute, 72°C for one minute. After the 25th cycle, the reaction 
mixture is placed at 60°C for up to 83 minutes. After thermocycling is complete, 
the reaction is held at 4°C until ready for gel electrophoresis. 

Example 5: Analysis of Amplification Products by Gel Electrophoresis 
5 Five pi of amplification product (produced as described above in Example 3) 

are mixed with 5 |xl of 2X loading buffer (0.25% Bromphenol Blue, 12.5% Ficoll 
400, 50 mM EDTA, 5X TAN (10X TAN: 0.4 M Tris, 40 mM Na Acetate 
Trihydrate, 10 mM EDTA, pH to 7.9 with acetic acid)). The 10 fil mixture is loaded 
into a well in a 1% agarose gel prepared with TE buffer and containing 0.5 jig of 
1 0 ethidium bromide per ml of agarose gel. Appropriate size ladder is also loaded on 
the gel. The gel is then electrophoresed in TAE buffer for 1 hour at 100 volts, and 
then illuminated with UV light on a transilluminator, and photographed. The bands 
in the photograph are then compared to the literature supplied by the manufacturer to 
determine the precise alleles present in the sample. 

15 Example ft QmtiQJXQfihtVmqwi&mtifa 

A set of blood samples was prepared and tested with the AmpF/STR Profiler 
Plus™ system as described in Examples 2 through 4, and the results are shown in 
Table 2. 

Table 2. Alleles found in seven human individuals when tested for eight STR loci in 
20 the AmpF/STR Profiler Plus™ system (PE Applied Biosystems, Foster City, 



California, USA). 



Locus 


#1 


#2 


#3 


#4 


#5 


#6 


#7 


D3S1358 


15 


14,15 


15,18 


16,17 


15,17 


17,18 


15 


vWA 


15,18 


15,16 


13,14 


16,17 


15,20 


17,19 


14,15 


FGA 


19,24 


20,21 


22,24 


23 


21,22 


21,23 


24,28 


Amelogenin 


X,Y 


X.Y 


X,Y 


X,Y 


X 


X 


X 


D8S1179 


12,15 


13,15 


12,14 


13,15 


14 


13 


14,15 


D21S11 


332 


29,30 


30,31.2 


29,32 


28,31 


28,33.2 


32.2,38 
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D5S818 


12 


11,13 


8,12 


1012 


13,14 


11 


12,13 


D13S317 


12,13 


11,13 


11,12 


9,12 


12,14 


9,14 


12 


D7S820 


8,9 


8,12 


10,11 


8,11 


10,12 


10,11 


9,10 


D18S51 


11,15 


9,14.2 


13.2,20 


18,24 


9,18 


10,12 


10,18 



5 The polymorphism data for a sample can be coded in a number of ways. The 

raw data for individual #1, for example, is as follows: 

D3S1358 15;vWA 15,18;FGA 19,24;AmelogeninX,Y;D8S1179 12,15; 
D21S 11,33.2;D5S818 12;D13S317 12,13;D7S820 8,9;D18S51 11,15 

This data can be used "raw" as the unique identifier (Le. 9 "as is," as above), with no 
1 0 alteration. For repositories with very large numbers of samples, this may be 
desireable, as it is the most "foolproof 1 method. 

Alternatively, the STR loci can be "coded," Le. 9 each locus represented by a 
combination of numbers or letters, D3S1358 can be represented by "A" or "01," 
vWA by "B" or "02," etc. The raw data so coded would then be: 

15 A,153,15,18,C,19,24,DX^ 
0U5,02,15,18,03,19,24,M^ 

All patents, patent applications, and references cited above are hereby 
incorporated by reference in their entirety. While this invention has been 
particularly shown and described with references to preferred embodiments thereof, 
20 it will be understood by those skilled in the art that various changes in form and 
details may be made therein without departing from the spirit and scope of the 
invention as defined by the appended claims. 
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CLAIMS 



What is claimed is: 



1 . A method for producing a unique identifier for a biological sample, the 
method comprising: 

5 (a) detecting one or more polymorphisms within the biological sample; 

and 

(b) selecting one or more polymorphisms sufficient to form a unique 
identifier; 

thereby producing a unique identifier for a biological sample. 



10 2. The method of Claim 1, wherein the biological sample is taken from an 

organism selected from the group consisting of: vertebrates, invertebrates, 
plants, and microorganisms. 

3. The method of Claim 1, wherein the biological sample is from a mammal. 

4. The method of Claim 3, wherein the mammal is a human. 



15 5. The method of Claim 3, wherein the biological sample is selected from a 

group consisting of blood, saliva, hair, body fluid, tissues, organs, and one or 
more cells. 



6. The method of Claim 5, wherein the polymorphism is selected from the 
group consisting of nucleic acid polymorphisms, protein polymorphisms, 
20 enzyme polymorphisms, chemical polymorphisms, biochemical 

polymorphisms, phenotypic polymorphisms, and quantitative 
polymorphisms. 
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7. The method of Claim 6, wherein the polymorphism is a nucleic acid 
sequence polymorphism. 

8. The method of Claim 7, wherein the polymorphism is a nucleic acid length 
polymorphism. 

5 9. The method of Claim 8, wherein the polymorphism is a short tandem repeat 
(STR). 

1 0. The method of Claim 1 , wherein the unique identifier is also linked to the 
source of the biological sample. 

1 1 . The method of Claim 1 , wherein the unique identifier is also linked to 
10 relevant information about the biological sample or the source of the 

biological sample. 

12. The method of Claim 1, wherein the unique identifier is selected from the 
group consisting of an alphanumeric string, and a bar code. 

13. A method for establishing a repository containing a collection of biological 
15 samples, wherein each biological sample has a unique identifier associated 

with it, the method comprising: 

(a) obtaining a biological sample from a source; 

(b) detecting one or more polymorphisms in the sample; 

(c) selecting one or more polymorphisms sufficient to form a unique 
20 identifier; 

(d) using the unique identifier to identify the sample; 

(e) storing the sample with the unique identifier, 

(f) repeating steps (a) through (e) for biological samples from other 
sources; 
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thereby establishing a repository containing a collection of biological 
samples, wherein each such biological sample has a unique identifier 
associated with it. 

14. The method of Claim 13, wherein the samples are DNA-containing samples. 
5 15. The method of Claim 13, wherein the sample source is a human. 

16. The method of Claim 13, wherein the polymorphism is selected from the 
group consisting of nucleic acid polymorphisms, protein polymorphisms, 
enzyme polymorphisms, chemical polymorphisms, biochemical 
polymorphisms, phenotypic polymorphisms, and quantitative 

10 polymorphisms. 

17. The method of Claim 16, wherein the polymorphism is a short tandem repeat 
(STR). 

18. The method of Claim 13, wherein the unique identifier is selected from the 
group consisting of an alphanumeric string, and a bar code. 

15 19. The method of Claim 13, wherein the unique identifier is also linked to the 
source of the biological sample. 

20. The method of Claim 13, wherein the unique identifier is also linked to 
relevant information about the biological sample or the source of the 
biological sample. 

20 21. A method of determining, by means of a unique identifier, if a source is 
represented by a sample within the repository of Claim 15, the method 
comprising: 

(a) obtaining a sample from the source; 
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(b) detecting one or more polymorphisms in the sample; 

(c) selecting one or more polymorphisms sufficient to form a unique 
identifier, wherein the polymorphisms selected are those used to form 
unique identifiers for the samples within the repository; 

5 (d) comparing the unique identifier of (c) to the unique identifier of each 

sample in the repository; 
wherein shared identity between the unique identifier of (c) to a unique 
identifier of a sample in the repository indicates that the source is represented 
by a sample within the repository. 

1 0 22. The method of Claim 2 1 , wherein the samples are DNA-containing samples. 

23. The method of Claim 21, wherein the source of the sample is a human. 

24. The method of Claim 21, wherein the polymorphism is selected fronrthe 
group consisting of nucleic acid polymorphisms, protein polymorphisms, 
enzyme polymorphisms, chemical polymorphisms, biochemical 

15 polymorphisms, phenotypic polymorphisms, and quantitative 

polymorphisms. 

25 . The method of Claim 24, wherein the polymorphism is a short tandem repeat 
(STR). 

26. The method of Claim 21 , wherein the unique identifier is also linked to the 
20 source of the biological sample. 



27. 



The method of Claim 21, wherein the unique identifier is also linked to 
relevant information about the biological sample or the source of the 
biological sample. 
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28, Themethod of Claim21, wherein the unique identifier is selected from the 
group consisting of an alphanumeric string, and a bar code. 



29. A method for linking, by means of a unique identifier, a member of a first 
group with a member of a second group, wherein the first group comprises a 
5 biological sample lacking a unique identifier, and a source of a biological 

sample lacking a unique identifier, and wherein the second group comprises 
a biological sample having a unique identifier, a source of a biological 
sample having a unique identifier, or information having a unique identifier, 
the method comprising: 
10 (a) detecting one or more polymorphisms in the member of the first 

group; 

(b) selecting one or more polymorphisms sufficient to form a unique 

identifier, wherein the polymorphisms selected are those used to form 
the unique identifier for the members of the second group; 
1 5 (c) comparing the unique identifier of the member of the first group to 

the unique identifier of the member of the second group; 
wherein shared identity between the unique identifier of the member of the 
first group and the unique identifier of member of the second group, links the 
member of the first group with the member of the second group. 
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